Data scarcity and AI solutions: Social research in new ways
20 Jan 2025
The social sciences rely on data from surveys or data about users. To obtain it, LMU researchers are increasingly relying on artificial intelligence and analyzing digital footprints.
Data is indispensable as the basis for serious social science or statistical research. It can provide information on how people feel about democracy in a society, how the labor market is changing, or how wide the gap is between rich and poor: It is the basis for studies whose results can even be very relevant for political decisions. If the data source is not reliable, that can have undesirable consequences. “In fact, in many surveys the sample is not a random sample from the population of interest, but some kind of a selective sample. This can distort the results if the aim is to describe this population,” explains sociologist Katrin Auspurg.
“We saw this in surveys conducted during the Covid-19 pandemic, for example, which often missed out older and less educated people. This meant not enough consideration was given — from a societal and political perspective — to a number of problems that these people had. At least some of the results found with the selective sample of younger and more educated people may not be generalizable to the whole population.”
In fact, in many surveys the sample is not a random sample from the population of interest, but some kind of a selective sample.
Katrin Auspurg, Chair of Quantitative Empirical Research at LMU
Problems with data collection
One problem that poses challenges for researchers in the social sciences is therefore of major societal relevance: the fact that data is becoming increasingly difficult to collect by traditional means. “Many of the people we want to recruit for a survey are no longer taking part in surveys at all,” says political scientist Professor Alexander Wuttke from LMU, whose research includes investigating the trends surrounding democracy in Germany. One of the problems he sees in the field of democracy research is a general increase in skepticism towards academic research.
Katrin Auspurg, whose academic work focuses on quantitative social research, can confirm this trend: “In the 1980s, 70% of people who were contacted for population surveys such as the ALLBUS took part. Now it’s only around 30%. Generally speaking, you can only expect a participation rate of one third in either face-to-face or postal surveys using random samples from municipal population registers. The rates for telephone surveys are frequently even lower.”
Many of the people we want to recruit for a survey are no longer taking part in surveys at all.
Alexander Wuttke, Professor Digitalization and Political Behavior at LMU
Auspurg also identifies changes in the culture around doing surveys as a challenge. “The number of surveys being done has increased hugely because it’s now so much easier and cheaper to conduct them online.” Added to that, in many cases they are not even academic surveys but market research instead. Marketing calls, too, are often disguised as surveys.
Demographic aspects also influence access to people, says Auspurg. “It’s easier to get people on middle incomes, for example, to complete a survey. Less-educated individuals, older people, or those with a migration background, on the other hand, are harder to recruit for surveys.” If the data analysis does not correct for this, surveys suffer from what’s known as middle-class bias.
Alexander Wuttke points out that telephone surveys are also becoming increasingly difficult “because many households no longer use a landline and people don’t answer cell phone calls as often.” What this means, he goes on to say, is that many sections of the population are excluded and the claim to be a representative survey is pretty much an illusion.
News
Video with Katrin Auspurg: Why the gender wage gap exists
“The word ‘representative’ is often a smokescreen, its meaning unclear. To be able to make a broad statement, you need a random sample that is as close as possible to the overall population,” says Katrin Auspurg. “But depending on the research question, it can make sense to step away from random sampling in order to study certain aspects in a more targeted manner.” For experimental research, for example, or if you’re trying to investigate things about specific hard-to-reach groups. But in that case, she explains, you need to clearly explain and justify the type of sample you used.
So, the challenges are numerous. But the researchers are full of ideas, actively accessing alternative data sources or drawing on digital data footprints. Alexander Wuttke explains, “When people use platforms, like X for example, and express their opinions there, they leave behind a digital footprint that can be seen by anyone and analyzed. This is an important trend in the research field, one that has emerged as an addition to survey-based research. For instance, we are able to see when users express anti-democratic opinions.” However, what this does not do is allow you to reflect a cross-section of society; at best, you would be able to analyze what users of X are saying.
Request for data donations
Nevertheless, Frauke Kreuter, a Professor at LMU’s Institute of Statistics, emphasizes the enormous potential of digital data. For one thing, she says, it is significantly cheaper than collecting data the traditional way. And it is easier, too, because you collect the data passively: You are less reliant on the respondents’ memory and you can simply measure how many steps a person has been taking if you’re doing a health study — instead of asking them how much movement they’ve done in the past year.
What’s more, the EU-wide General Data Protection Regulation (GDPR), in force since 2018, opens up the possibility for people to make data donations to researchers: Any online service providers must provide users with their user data if they request it. “We can then ask these people whether they would like to give their data to us for research purposes — for citizen science, so to speak,” says Kreuter.
Unfortunately, when using digital data, people often don’t pay attention to the processes used to generate that data and where it comes from.
Frauke Kreuter, Chair of Statistics and Data Science in Social Sciences and the Humanities
Using digital data: Lessons to learn
However, digital data also has its secrets that have not (yet) been revealed. “Unfortunately, when using digital data, people often don’t pay attention to the processes used to generate that data and where it comes from.” Frauke Kreuter thinks one reason for this is that platform operators are reluctant to reveal how their algorithms work.
In addition to that, she says not enough is known about the social behaviors around using digital media. “There’s no account taken of when several people are using the same device. Or when women carry their smartphone not on their person but in their purse — which they sometimes put down. If you were trying to measure the number of steps they’d taken each day, the data would be incorrect.”
Digital data is also more difficult to analyze. “We just don’t have the number of data specialists we would need to get everything right with passive reading,” says Kreuter bluntly.
Another important source for research purposes, says the LMU researcher, could be administrative data, such as the kind collected by public institutions. Here, however, data protection tends to present an obstacle that makes it more difficult to gain access. As Frauke Kreuter explains, “In Denmark, for example, administrative registry data is accessible for research purposes. Population insights are also more precise. While some progress has been made with administrative research data centers, we still have a lot of catching up to do here in Germany.”
She thinks the reason for Germany’s reluctance lies in a general misunderstanding “We tend to focus on protecting data instead of protecting people,” says Kreuter “Save use of data is possible”.
Give and take in cooperation with authorities
One way of still enabling access to data from public institutions could be through collaborations with benefits for both sides: Researchers could give administrative staff the tools to use their data to better manage their own processes, and to gain insights from it themselves. Conversely, the data itself could be made available to researchers for their work. Frauke Kreuter and her fellow researchers have already launched a promising initiative in the United States, where she still has research ongoing.
“In Germany, we recently worked with the Bavarian Ministry of Digitalization and the Directorate General of the State Archives and the Bavarian State Ministry of Justice,” explains the professor. “We moved 60,000 folders into a secure cloud environment and showed the employees how to take samples from them, and other things, too.” At the same time, students and doctoral candidates were able to use this data to work on research questions, says Kreuter.
It’s all about the mix
All in all, the LMU researchers are certain that employing a mix of different approaches to data collection could help to close the gap resulting from the difficulties with traditional qualitative and quantitative surveys. Alexander Wuttke points out that in democracy research, discrepancies can be identified and classified through triangulation — the use of different data sources: “In surveys, people will say they value democracy, but at the same time we observe anti-democratic voting behavior. By combining different data sources, we are better able to understand why these discrepancies occur.”
Qualitative versus quantitative — AI can help
But traditional research practice, too, could soon be made much easier. “I think that artificial intelligence can help us to resolve or at least limit the conflict between in-depth qualitative research and quantitative research that relies on large volumes of data,” says Alexander Wuttke confidently.
News
Alexander Wuttke: Studying how democracies keep going
According to the researcher, the qualitative approach is time consuming and researchers need to take the time to understand what people think or what motivates them to do this or that. “You always need interviewers who have to take their time. You can only do that up to a point and not hundreds or thousands of times. Large comparative studies across different states are therefore inconceivable.”
AI technology like large language models, or LLMs for short, could help here. Using them would make it possible to hold direct conversations, with questions and clarifications, and to react as the situation demands.
Alexander Wuttke and Frauke Kreuter have already initiated a pilot study on this with students, which is showing some very promising signs. But there are also disadvantages here — the lack of the empathy you get in face-to-face conversations. “You could probably do thousands of interviews a day, but AI can’t do it with the same empathy as a human,” admits Wuttke. It remains to be seen whether people will be as happy to answer an AI as they would be to answer a human.
Algorithms are trained on data. To do that, the AI needs high-quality data.
Frauke Kreuter
Traditional methods remain relevant
From the context of her interdisciplinary research, Frauke Kreuter knows that there is no either/or between the proven methods of social science research and the new possibilities that digitalization and AI open up for research: “Algorithms are trained on data. To do that, the AI needs high-quality data,” says the LMU researcher. “I foresee an increasing interest in traditional data collection in the social sciences, if only because AI requires good comparative data.”
Recognizing the quality of a study:
Random samples are the most meaningful type of samples. This involves surveying a group of people whose composition matches that of the group the researchers are seeking information about. If you’re surveying the average income in Germany, for example, the random sample will include working people of different ages, genders, and occupations.
Less meaningful are samples that survey people who volunteer themselves, as is often the case with online surveys. The findings cannot then be applied as a general statement of fact, but only to say something about the group that participated.
Importantly, any media that quote or refer to surveys should provide background information on how the surveys were conducted. For example, the sheer number of respondents is irrelevant if the survey is biased.